บทนำสู่การเขียนโปรแกรมด้วย Triton: ไปไกลกว่าการดำเนินการแบบองค์ประกอบเดียว: การเปลี่ยนไปใช้การดำเนินการแมทริกซ์แบบแบ่งเป็นช่อง

ในบทเรียนก่อนหน้า เราได้เน้นไปที่ การดำเนินการแบบองค์ประกอบเดียว (เช่น ฟังก์ชัน ReLU พื้นฐานบนเมทริกซ์) การดำเนินการเหล่านี้เป็น ขึ้นอยู่กับหน่วยความจำ เพราะว่าจีพียูใช้เวลานานในการถ่ายโอนข้อมูลจากหน่วยความจำไฮเปอร์ (HBM) ไปยังเรจิสเตอร์ มากกว่าการดำเนินการทางคณิตศาสตร์

1. ทำไม GEMM จึงสำคัญ

การคูณเมทริกซ์ทั่วไป (GEMM) มีความซับซ้อนด้านการคำนวณเป็น $O(N^3)$ แต่ต้องการการเข้าถึงหน่วยความจำเพียง $O(N^2)$ เท่านั้น ทำให้เราสามารถซ่อนเวลาหน่วงการเข้าถึงหน่วยความจำไว้เบื้องหลังอัตราการประมวลผลทางคณิตศาสตร์ที่สูงมาก จึงกลายเป็น 'หัวใจ' ของโมเดลภาษาขนาดใหญ่ (LLMs)

2. การแสดงหน่วยความจำแบบสองมิติ

หน่วยความจำฟิสิกส์เป็นแบบหนึ่งมิติ เพื่อแสดงเทนเซอร์สองมิติ เราจึงใช้ สตรายด์. ปัญหาที่พบบ่อยในงานผลิตคือ การสมมุติว่าเทนเซอร์เป็นต่อเนื่อง. หากคุณสับสนระหว่างสตรายด์แถวและสตรายด์คอลัมน์ในการคำนวณพอยน์เตอร์ คุณจะเข้าถึงข้อมูล 'ลับ' หรือกระตุ้นให้เกิดข้อผิดพลาดด้านหน่วยความจำ

3. การขยายแนวคิดแบบแบ่งเป็นช่อง

Triton ขยายตรรกะการดำเนินการแบบองค์ประกอบเดียว โดยเปลี่ยนจากการใช้ พอยน์เตอร์เดี่ยว ไปเป็น กลุ่มพอยน์เตอร์. โดยการใช้ช่องแบบสองมิติ (เช่น $16 \times 16$) เราจะใช้ประโยชน์จาก การนำข้อมูลมาใช้ซ้ำ ในหน่วยความจำ SRAM ความเร็วสูง ทำให้ข้อมูลยังคง 'ร้อน' สำหรับการทำงานรวม เช่น การเพิ่มเบียส หรือการกระตุ้น (activation) ก่อนเขียนกลับไปยังหน่วยความจำหลัก (Global Memory)

TERMINALbash — 80x24

> Ready. Click "Run" to execute.

QUESTION 1

Why is an elementwise ReLU on a large matrix considered 'memory-bound'?

The ReLU function requires complex transcendental math.

The ratio of arithmetic operations to memory loads is very low (1:1).

Matrices are naturally stored in CPU memory only.

Triton cannot process non-linear activations.

QUESTION 2

What is the result of 'The Stride Trap' in production kernels?

The kernel runs significantly faster but with less precision.

Memory access violations or corrupted output due to incorrect address calculation on non-contiguous tensors.

The GPU automatically corrects the indexing using L2 cache.

The tensor is forced into a 1D shape by the compiler.

QUESTION 3

How does Triton represent a 2D tile of pointers?

By using a nested Python list of integers.

By broadcasting a 1D column vector and a 1D row vector of offsets together.

By launching multiple 1D kernels sequentially.

By allocating a special 2D register file.

QUESTION 4

Which operation benefits most from the O(N³) complexity shift to hide memory latency?

Vector Addition

Matrix Multiplication (GEMM)

Sigmoid Activation

Global Average Pooling

QUESTION 5

List three kernels in your current workflow that launch multiple PyTorch ops and might benefit from fusion.

Linear -> Bias -> ReLU; LayerNorm -> Dropout; Softmax -> Masking.

Print -> Log -> Sleep.

DataLoader -> Augmentation -> Storage.

These ops cannot be fused in Triton.